Missing data are ubiquitous in real world applications and, if not adequately handled, may lead to the loss of information and biased findings in downstream analysis. Particularly, high-dimensional incomplete data with a moderate sample size, such as analysis of multi-omics data, present daunting challenges. Imputation is arguably the most popular method for handling missing data, though existing imputation methods have a number of limitations. Single imputation methods such as matrix completion methods do not adequately account for imputation uncertainty and hence would yield improper statistical inference. In contrast, multiple imputation (MI) methods allow for proper inference but existing methods do not perform well in high-dimensional settings. Our work aims to address these significant methodological gaps, leveraging recent advances in neural network Gaussian process (NNGP) from a Bayesian viewpoint. We propose two NNGP-based MI methods, namely MI-NNGP, that can apply multiple imputations for missing values from a joint (posterior predictive) distribution. The MI-NNGP methods are shown to significantly outperform existing state-of-the-art methods on synthetic and real datasets, in terms of imputation error, statistical inference, robustness to missing rates, and computation costs, under three missing data mechanisms, MCAR, MAR, and MNAR.
translated by 谷歌翻译
Adversarial perturbation plays a significant role in the field of adversarial robustness, which solves a maximization problem over the input data. We show that the backward propagation of such optimization can accelerate $2\times$ (and thus the overall optimization including the forward propagation can accelerate $1.5\times$), without any utility drop, if we only compute the output gradient but not the parameter gradient during the backward propagation.
translated by 谷歌翻译
电子健康记录(EHRS)为推进精密医学提供了巨大的承诺,同时也提出了重大的分析挑战。特别是,由于政府法规和/或机构政策,通常无法在机构(数据源)之间共享EHR中的患者级数据。结果,对在多个EHR数据库中分布学习的兴趣越来越大,而无需共享患者级数据。为了应对此类挑战,我们提出了一种新颖的沟通高效方法,该方法通过将问题转变为缺失的数据问题来汇总本地最佳估计。此外,我们建议将远程站点的后验样品合并,这些样本可以提供有关缺失数量的部分信息,并提高参数估计的效率,同时具有差异隐私属性,从而降低信息泄漏的风险。建议的方法在不共享原始患者级别数据的情况下可以进行适当的统计推断,并可以容纳稀疏的回归。我们为统计推断和差异隐私的提议方法的渐近性质提供了理论研究,并根据几种最近开发的方法评估了其在模拟和实际数据分析中的性能。
translated by 谷歌翻译
每个例子梯度剪辑是一个关键算法步骤,可实现对深度学习模型的实用差异私有(DP)培训。但是,剪辑规范$ r $的选择对于在DP下实现高精度至关重要。我们提出了一个易于使用的替代品,称为Autoclipping,它消除了任何DP优化器(包括DP-SGD,DP-ADAM,DP-LAMB等)调整$ R $的需求。自动变体与现有的DP优化器一样私有和计算效率,但不需要DP特定的超参数,因此使DP培训与标准的非私人培训一样适合。我们在非凸vex设置中对自动DP-SGD进行了严格的融合分析,这表明它具有与标准SGD相匹配的渐近收敛速率。我们还展示了各种语言和视觉任务,这些任务自动剪辑优于或匹配最新的,并且可以轻松使用对现有代码库的最小更改。
translated by 谷歌翻译
大型卷积神经网络(CNN)可能很难在差异私有(DP)方面进行训练,因为优化算法需要计算昂贵的操作,称为每样本梯度剪辑。我们提出了对卷积层的这种剪辑的有效且可扩展的实施,称为混合的幽灵剪裁,从而在不影响准确性的情况下大大简化了私人培训。通过对混合幽灵剪辑和现有的DP培训算法进行的首次复杂性分析,严格研究了效率的提高。关于视力分类任务的广泛实验,具有大型重新连接,VGG和视觉变压器,证明了与混合幽灵剪裁的DP培训增加了$ 1 \ sim 10 \%$内存开销,$ <2 \ 2 \ times $ slowdown for标准的非私人培训减速。具体来说,当在CIFAR10上培训VGG19时,混合的幽灵剪裁的价格是$ 3 \ times $ $ $ $比最先进的Opa​​cus库,价格为$ 18 \ times $ $最大批处理大小。为了强调有效的DP培训对卷积层的重要性,我们使用BEIT在$ \ epsilon = 1 $上实现了CIFAR10上的96.7 \%精度和83.0 \%的CIFAR100 \%,而先前的最佳结果为94.8 \%\%\%和67.4 \%\%\%\%\%,,,,,\%分别。我们打开隐私引擎(\ url {https://github.com/jialinmao/private_cnn}),该引擎将CNN的DP培训使用几行代码实现DP培训。
translated by 谷歌翻译
在大多数现实世界问题中存在缺失数据,需要仔细处理,以保留下游分析中的预测精度和统计一致性。作为处理缺失数据的金标准,提出了多个归纳(MI)方法来解释归纳不确定性并提供适当的统计推断。在这项工作中,我们通过生成的对抗网络(MI-GAN)提出多种归责,基于深度学习(基于GAN的)多重归名方法,可以在具有理论支持的随机(MAR)机制下缺失工作。Mi-GaN利用最近在有条件的生成对抗性神经作业中的进展,并在归责误差方面表现出对高维数据集的现有最先进的估算方法的强大性能。特别是,MI-GaN在统计推理和计算速度的意义上显着优于其他估算方法。
translated by 谷歌翻译
在线性回归中,斜率是一种新的凸分析方法,通过分类的L1惩罚推广套索:更大的装配系数更大。这种依赖性正则化需要输入惩罚序列$ \ lambda $,而不是在套索案件中的标量惩罚,从而使设计在计算中非常昂贵。在本文中,我们提出了两个有效的算法来设计可能的高维坡度损失,以便最小化平均平方误差。对于高斯数据矩阵,我们在近似消息传递制度下提出了一个第一个订单投影梯度下降(PGD)。对于一般数据矩阵,我们呈现了一个零阶坐标血统(CD)来设计斜率的子类,称为K级斜率。我们的CD允许在准确性和计算速度之间进行有用的权衡。我们通过对综合数据和现实世界数据集的广泛实验展示了坡度与我们的设计的表现。
translated by 谷歌翻译
We aim to bridge the gap between our common-sense few-sample human learning and large-data machine learning. We derive a theory of human-like few-shot learning from von-Neuman-Landauer's principle. modelling human learning is difficult as how people learn varies from one to another. Under commonly accepted definitions, we prove that all human or animal few-shot learning, and major models including Free Energy Principle and Bayesian Program Learning that model such learning, approximate our theory, under Church-Turing thesis. We find that deep generative model like variational autoencoder (VAE) can be used to approximate our theory and perform significantly better than baseline models including deep neural networks, for image recognition, low resource language processing, and character recognition.
translated by 谷歌翻译
Crowdsourcing, in which human intelligence and productivity is dynamically mobilized to tackle tasks too complex for automation alone to handle, has grown to be an important research topic and inspired new businesses (e.g., Uber, Airbnb). Over the years, crowdsourcing has morphed from providing a platform where workers and tasks can be matched up manually into one which leverages data-driven algorithmic management approaches powered by artificial intelligence (AI) to achieve increasingly sophisticated optimization objectives. In this paper, we provide a survey presenting a unique systematic overview on how AI can empower crowdsourcing - which we refer to as AI-Empowered Crowdsourcing(AIEC). We propose a taxonomy which divides algorithmic crowdsourcing into three major areas: 1) task delegation, 2) motivating workers, and 3) quality control, focusing on the major objectives which need to be accomplished. We discuss the limitations and insights, and curate the challenges of doing research in each of these areas to highlight promising future research directions.
translated by 谷歌翻译
Multi-uncertainties from power sources and loads have brought significant challenges to the stable demand supply of various resources at islands. To address these challenges, a comprehensive scheduling framework is proposed by introducing a model-free deep reinforcement learning (DRL) approach based on modeling an island integrated energy system (IES). In response to the shortage of freshwater on islands, in addition to the introduction of seawater desalination systems, a transmission structure of "hydrothermal simultaneous transmission" (HST) is proposed. The essence of the IES scheduling problem is the optimal combination of each unit's output, which is a typical timing control problem and conforms to the Markov decision-making solution framework of deep reinforcement learning. Deep reinforcement learning adapts to various changes and timely adjusts strategies through the interaction of agents and the environment, avoiding complicated modeling and prediction of multi-uncertainties. The simulation results show that the proposed scheduling framework properly handles multi-uncertainties from power sources and loads, achieves a stable demand supply for various resources, and has better performance than other real-time scheduling methods, especially in terms of computational efficiency. In addition, the HST model constitutes an active exploration to improve the utilization efficiency of island freshwater.
translated by 谷歌翻译